Informatics

statistics circle for analysing byte entropy in files

In a previous post I introduced triops, a multiplatform cmdline encryption tool using CHACHA20 algorithm. In general, any encryption algorithm should produce a reasonably high entropic output no matter the input (plain text) at hand — this is in order to be aligned with concepts like Entropic security and Semantic security, very important in modern cryptography.

So, would triops’ outputs have high enough entropy? I developed a quick program to test this: in short, it reads all bytes in a file, and counts and arranges them from 0x00 to 0xff. Them it calculates the Standard deviation (sigma) of the data, and arranges the 256 resulting buckets in a circle, in such a way that the distance from the centre is proportional to the byte value (with 0x00 at the center, and …0xf0-0xff bytes the farthest from it). The char that represents each byte is proportional to the deviation from the mean, in fractions of the standard deviation.

The source code can be found at github, and executables for various OS are available here.

A picture is worth a thousand words:

circle-gplv3-encrypted

In this case the circle represents the deviation from sigma in the encrypted file gplv3.txt.$#3, which contained just readable text before encrypted with triops. Green represents variations above the mean, and red below it, with the size of the char directly proportional to the absolute magnitude of the deviation in fractions of sigma: from “,” representing 1/4 of sigma to “@” representing 9/4 = 2.25 sigma or above. The complete list of chars is from 0 to 9:

{ ‘.‘, ‘,‘, ‘‘, ‘~‘, ‘+‘, ‘*‘, ‘o‘, ‘O‘, ‘#‘, ‘@‘ };

As observed, the Coefficient of variation (CV) is low as expected from a random input, even when in this case the sample size is relatively low: 35000 bytes. CV is here a good measurement: it is dimensionless, and will allow us to compare entropy from files with different sizes.

The file gplv3.txt before encryption has this output:

circle-gplv3-plaintext

Blue chars indicate bytes that are absent from the sample. As it is a plain text file, the set of bytes used is reduced and connected, so the representation resembles a centered ring. As it is english text, entropy isn’t high either, as human language has very well known repetitive patterns and characters more used than others (sigma and CV have immense values).

But, is the encrypted file’s entropy really high? How well would it compare with other type of files? Let’s compare big files with the same size:

A random content rar compressed archive:

circle-archive-rar

A file filled with random bytes from /dev/urandom:

pi@raspbmc:~$ dd if=/dev/urandom of=random.bin bs=1 count=750709501
750709501+0 records in
750709501+0 records out
750709501 bytes (751 MB) copied, 40300.1 s, 18.6 kB/s
pi@raspbmc:~$ circle random.bin

circle-urandom

And a zero filled file encrypted with triops. Let’s create the zeroes as a sparse file (the difference in bytes (-72) is to make room for the bytes that triops uses when encrypting for IV and password hashed hint: see triops’ post for explanation on these concepts):

pi@raspbmc:~$ dd if=/dev/zero of=blank.bin bs=1 count=0 seek=750709429
0+0 records in
0+0 records out
0 bytes (0 B) copied, 8.6996e-05 s, 0.0 kB/s
pi@raspbmc:~$ triops _circle\!_ blank.bin = 3
pi@raspbmc:~$ circle blank.bin.\$#3

circle-blank-bin-encrypted

Well, CV and sigma from the encrypted file seems very similar to a random source of bits (CV=0.1% in both cases), and not just compressed or in any other way crafted content. Moreover, variations over sigma are well distributed in the sample: in the case of the compressed file, it was a calm sea with some well defined peaks. This conclusion is reassuring from a cryptographic point of view: the encryption seems entropic strong.

In case you’re wondering, a zero filled file has this circle:

circle-blank-bin

The program can also show a second circle not centered on 0x00 byte, but on any other value: by default it centers on byte 127 :

circle-gplv3-encrypted-two-circles

Options for non-colored consoles (in this case chars represent increments of 0.5 sigma and zero is char ‘*’) and for using numbers instead of ASCII art are also available:

$ circle

Show statistics about bytes contained in a file,
as a circle graph of deviations from sigma.
Use:
$ circle <filename> [0|1=no color,2=numbers,3=uncoloured numbers] [0-255=two circles!]

3 thoughts on “statistics circle for analysing byte entropy in files

  1. Pingback: Compiling complex numbers in C on a Solaris 9 platform | circulos meos

  2. Pingback: Checking passwords against size-reduced haveibeenpwned.com hashes files | circulos meos

  3. Pingback: continuous tailing of a gzip file (efficiently!) | circulos meos

Leave a comment